Large language models, behaviour and cognition

Bennett Kleinberg (Tilburg University & University College London)

26 Sept. 2024

Large language models, behaviour and cognition: Making sense of the new black boxes with old tricks



Bennett Kleinberg

Tilburg University & University College London

IMT Lucca, 26 September 2024

A new black box?

Large language models

Remarkable potential (Kaddour et al. 2023) and danger (Mozes et al. 2023):

  • opinion manipulation (Hackenburg and Margetts 2024)
  • designing new chemical weapons (Urbina et al. 2022)
  • propagation of racist stereotypes (Hofmann et al. 2024)
  • reducing conspiracy beliefs (Costello, Pennycook, and Rand 2024)
  • sicilon sampling idea (Argyle et al. 2022)

This talk

  • we have reached a new different era of artificial intelligence models
  • the key driver: computational power and hence larger models
  • focus of this talk: large language models (but applies to other AI models too)

What do we mean by LLMs?

Before large language models:

  • human language explained with interpretable models that combine symbolic elements (e.g., POS)
  • rule-based operations

The GPT-3 approach:

  • very different scientific paradigm
  • learning from the wild
  • no instruction (supervision), no syntax, no rules

GPT-3’s training data

What we know (Brown et al. 2020)

Dataset Tokens (billion) Weight
Common Crawl 410 60%
Books 1 and 2 67 16%
Web Text 2 19 22%
Wikipedia 3 3%



LLMs are autoregressive language models

Autoregression?

Consider the following examples:

My name is ________

My name is Bond. ________

My name is Bond. James ________

Autoregression?



\(P(x_n | x_1, x_2, ..., x_{n-1})\)

Core idea: prompts = conditioning of the autoregressive probability function


What makes this difficult so to investigate then?

Why does this not work?

A statistical model (here: linear regression):

\(y_i = \alpha + \beta_1x_{i1} + \epsilon_i\)

This model has two parameters: \(\alpha\) and \(\beta_1\)

But LLMs have many parameters.

What is “large” then?

GPT-3:

  • 175 billion parameter model trained on 300 billion tokens (Brown et al. 2020)
  • training data size: 45 TB text data
  • very controversial when announced

Two big problems

Problem 1:

New AI models are quickly adopted (too quickly?) with

  • little to no understanding of them
  • far-reaching implications of their use (Mozes et al. 2023) Data Sceptic episode
  • no framework to assess them

Two big problems

Problem 2:

Analytical approaches inevitably unfeasible:

  • new scale of model size
  • put differently: we cannot just look at model parameters
  • remember: human brain project

The need for a different approach

  • a new black box
  • no need to reinvent the wheel

We already know how to study black boxes.

Old tricks for a new black box

Two recent ideas

  • Machine Behaviour (Rahwan et al. 2019)
  • Machine Psychology (Hagendorff et al. 2023)

One rooted in behaviourism.

The other is rooted more in modern cognitive science.

Let’s take a look back some decades

Behaviourism

Central argument:

Behave is what organisms do.

Psychology as a “pure branch of the natural sciences” (Watson 1913)

\(\rightarrow\) Focus on the observable (Skinner 1935)

Key idea

  1. Psychology is the science of behaviour (methodological behaviourism)
  2. Behaviour can be studied without reference to “inner processes” (psychological behaviourism)
  3. All theories about inner processes should be eliminated or rephrased (analytical behaviourism)

Two fundamental ideas

Learning: change in behaviour based on experience

  • classical conditioning (Pavlov, late 19th century) (Yerkes and Morgulis 1909)
  • operant conditioning (Thorndike, Skinner, early-mid 20th century) ]

Classical conditioning

In summary

  • stimulus-response chain
  • core idea: conditioned vs unconditioned stimuli
    • food (unconditioned) and bell (neutral –> conditioned) as stimuli
    • salivation remains the unconditioned response

Also called: respondent conditioning

Operant conditioning

Builds on classical conditioning.

But: focuses on how consequences affect voluntary behaviour

Key idea

  • Reinforcement: increase likelihood of behaviour
  • Punishment: decrease likelihood of behaviour
  • Both in two forms:
    • negative (removal of stimulus)
    • positive (addition of stimulus)

Also called instrumental conditioning

The operant conditioning chamber

Known as the Skinner box:

In a nutshell

Behaviourism studies psychological events in terms of behavioural criteria.

  • psychological hypotheses require behavioural evidence
  • two “states of mind” are only different if there is a difference in behaviour

\(\rightarrow\) mental states are deemed irrelevant

The behaviourist model

… this was a successful and dominant paradigm in psychology



So what happened?

The cognitive revolution

  1. Tolman’s maze (Edward C. Tolman 1948; E. C. Tolman and Honzik 1930)
  2. Shannon and Turing (Turing 2009) on principles of computation
  3. Miller’s magic number (George A. Miller 2003; George A. Miller 1956)
  4. Chomsky’s work on linguistics (Chomsky 2002)

Tolman’s experimental design

From Edward C. Tolman (1948)

The behaviourist’s view on learning

  • all learning occurs with interaction of the environment
  • all learning can be explained by conditioning (classical and operant)

Experimental groups

  1. No rewards
  2. Rewards

A third experimental group

  1. Delayed rewards (after 10 days)

What does the behaviourist predict?

Findings

Implications

  • reward and no reward group aligned with behaviourist predictions

But the delayed reward group:

There must have been some learning in days 1-10!



This is called latent learning.

This can only be accommodated in the cognitive paradigm.

The cognitive model

Fast forward to now…

Are we going through the same development from 100 years ago?

  1. Behaviourism \(\rightarrow\) Machine Behaviour
  2. [WHAT IS MISSING?]
  3. Cognitive science \(\rightarrow\) Machine Psychology

Machine Behaviour

[…] studying machine behaviour does not imply that AI algorithms necessarily have independent agency nor does it imply algorithms should bear moral responsibility for their actions. If a dog bites someone, the dog’s owner is held responsible. (Rahwan et al. 2019, 483)

[…] machines exhibit behaviours that are fundamentally different from animals and humans, so we must avoid excessive anthropomorphism and zoomorphism. Even if borrowing existing behavioural scientific methods can prove useful for the study of machines, machines may exhibit forms of intelligence and behaviour that are qualitatively different—even alien—from those seen in biological agents.(Rahwan et al. 2019, 483)

Machine Psychology

LLMs are, like the human brain, black boxes to some extent. (Hagendorff et al. 2023, 9)

Concepts such as reasoning, inituition, creativity, intelligence, personality, mental illness, etc. are transferred into LLMs. (Hagendorff et al. 2023, 9)

Machine Psychology

Three research challenges

This allows us to embrace three perspectives on the artificial mind

  1. testing (analogy: experimental psychology)
  2. measuring (analogy: psychometrics)
  3. mapping (analogy: neuroscience)

Testing the artificial mind

Work on LLMs vs human

From Jakesch, Hancock, and Naaman (2023)

Fresh work from our lab

Two potential confounds:

  1. humans had no incentive to convey humanness
  2. the tasks did not allow humans to shine through

Our context: relationship advice task

Human example

So you have managed to stay together for twenty years and now your relationship is threatened by the opportunity to move to another continent. My immediate opinion is that you should be able to talk this out, agree on a reasonable compromise which will allow you to stay together. A little give and take is needed, probably. This opportunity seems too good to turn down so it is really a question of working out a way of doing it which is agreeable to both. I suggest that the one who is against the move sits down with the other partner and calmly lay out their objections. These are serious objections and have to be considered. But there must be ways of overcoming them. Consider how long the move is to last for. Is there room for compromise there? What about holidays? What about regular visits back to the homeland? […] It’s not easy and you have my best wishes.

LLM example

Hey there, I can only imagine how tough things must be for you right now, considering the crossroads you’re at with your partner. Twenty years is a long time to share your life with someone, and facing the possibility of a breakup over a move to another continent is heart-wrenching. First off, it’s crucial to approach this upcoming confrontation with a clear head. I know emotions are running high, but try to see this as an opportunity for an open, honest conversation rather than a battleground. Remember, it’s not you versus them; it’s both of you versus the problem. Here’s what I suggest: find a quiet, comfortable space where you can talk without interruptions. Start by expressing how much you value the years you’ve spent together and the love you share. It’s important to acknowledge the weight of your history together before diving into the current issue. Then, lay out your feelings and concerns about the move openly. […]

Findings

There was a twist

Experimental approach:

  • Modality (humans vs LLM)
  • Condition (adversarial vs naive)
  • Task (high in empathy vs low in empathy)

All texts (\(n_{Exp.1}=1060\) and \(n_{Exp.2}=1064\)) were evaluated by independent human assessors (from 1=“definitely AI-generated” to 5=“definitely human-written”).

Findings

For naive instructions

For adversarial instructions

Stochastic empathy

  • overall: humans > LLM
  • key finding in the interaction:
    • only LLMs respond to adversarial instructions (Exp. 1: +39.3%, Exp. 2: +41.0%)
    • humans were not able to become more human
  • no differences in the low-empathy task

How did the LLM do this?

Looking deeper

What does this tell us?

Insights in the black box

The LLM’s language - when instructed to be human - contained:

  • a more informal and conversational tone
  • more self-references
  • more references to the present
  • simpler vocabulary
  • fewer stiff greetings

The LLM relies on an implicit representation of empathy: stochastic empathy.

Measuring the artificial mind

LLMs and psychometrics

“[Many of the constructs of interest] would be considered latent variables in psychological theory: these constructs are not directly observable nor directly measurable. Instead,these variables are indirectly measured through measurable behaviours hypothesised to be caused by the underlying latent trait.” (Peereboom, Schwabe, and Kleinberg, n.d.)

LLMs and latent variables

The argument:

Aside from looking at sum scores of measurement tools, we need to look at the latent structure if we assume we can use human instruments or if there should be a human like pattern in the data.

Design

  • Personality tests: HEXACO REF and Dark Side of Humanity (Ashton and Lee 2009; Katz et al. 2022)
  • Data from a representative human sample (\(n=401\)) and from LLMs (GPT-3.5, GPT-4, GPT-4 Turbo)

Analytical approach:

  1. Compare composite scores (sum scores)
  2. Compare latent structures (factor analysis)

Findings

First signs of trouble

But this is only half the story

Let us look at latent structures:

  • Assumptions for a confirmatory factor analysis not met in LLM data!

What if we look at a looser exploratory factor analysis?

The expected factor structure

Human data

LLM factor structure

Peereboom et al.’s conclusion

Our findings suggest that questionnaires designed for humans do not measure similar latent constructs in LLMs, and that these latent constructs may not even exist in LLMs in the first place. […] A thorough psychometric evaluation is essential for studying LLM behaviour. It may help us decide which effects are worth pursuing, and which effects are cognitive phantoms. (Peereboom, Schwabe, and Kleinberg, n.d.)

Mapping the artificial mind

Neuroscience of LLMs?

Recent work on “monosemanticity” (Templeton et al. 2024; Bricken et al. 2023)

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole.

For the human brain: neurons.

For an LLM: monosemantic features.

Visualisations

Monosemanticity demo

Excellent explainer: https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand

So what?

In conclusion

  • LLMs are here to stay and we need to understand them
  • Different approaches:
    • Machine Behaviour
    • Machine Psychology
    • “Machine Neuroscience”
  • All inspired by their parent disciplines

A look ahead

  • exciting new research problems
  • renaissance of methods (behaviourism \(\rightarrow\) cognitive science \(\rightarrow\) neuroscience)
  • let us avoid a renaissance of problems

This is where psychologists, cognitive scientists and neuroscientists can shine!

A category mistake?

Stochastic parrots! (Bender et al. 2021)

one important consequence of imprudent use of terminology in our academic discourse is that it feeds AI hype (Bender and Koller 2020, 5186)

Bigger and bigger models:

  • environmental costs (avg. human: 5t \(CO_2\); transformer model: 57x)
  • data, data, data
  • an LLM has no language understanding (Bender and Koller 2020) (see adversarial ML)

Adversarial machine learning

Demonstrated first by (Alzantot et al. 2018) and many others since:

See also Mozes, Kleinberg, and Griffin (2022), Mozes, Bartolo, et al. (2021), Mozes, Stenetorp, et al. (2021)

“GPT-3 is a model of how words relate to one another, not a model of how language might relate to the perceived world.” (Gary Marcus)

Open question

How do we best study the mind processes of an AI model?

Thank you

If you’re interested in this work, please get in touch.

Tomorrow: practical session on LLMs in R (9:00h)

References

Alzantot, Moustafa, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. “Generating Natural Language Adversarial Examples.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2890–96. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1316.
Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua Gubler, Christopher Rytting, and David Wingate. 2022. “Out of One, Many: Using Language Models to Simulate Human Samples.” arXiv Preprint arXiv:2209.06899.
Ashton, Michael C, and Kibeom Lee. 2009. “The HEXACO–60: A Short Measure of the Major Dimensions of Personality.” Journal of Personality Assessment 91 (4): 340–45.
Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23.
Bender, Emily M, and Alexander Koller. 2020. “Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–98.
Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:1877–1901. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Chomsky, Noam. 2002. Syntactic Structures. Mouton de Gruyter. https://doi.org/10.1515/9783110218329.
Costello, Thomas H., Gordon Pennycook, and David G. Rand. 2024. “Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science 385 (6714): eadq1814. https://doi.org/10.1126/science.adq1814.
Hackenburg, Kobi, and Helen Margetts. 2024. “Evaluating the Persuasive Influence of Political Microtargeting with Large Language Models.” Proceedings of the National Academy of Sciences 121 (24): e2403116121. https://doi.org/10.1073/pnas.2403116121.
Hagendorff, Thilo, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, and Eric Schulz. 2023. “Machine Psychology.” arXiv. https://doi.org/10.48550/ARXIV.2303.13988.
Hofmann, Valentin, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. “AI Generates Covertly Racist Decisions about People Based on Their Dialect.” Nature 633 (8028): 147–54. https://doi.org/10.1038/s41586-024-07856-5.
Jakesch, Maurice, Jeffrey T. Hancock, and Mor Naaman. 2023. “Human Heuristics for AI-Generated Language Are Flawed.” Proceedings of the National Academy of Sciences 120 (11): e2208839120. https://doi.org/10.1073/pnas.2208839120.
Kaddour, Jean, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. “Challenges and Applications of Large Language Models.” arXiv. http://arxiv.org/abs/2307.10169.
Katz, Louise, Caroline Harvey, Ian S. Baker, and Chris Howard. 2022. “The Dark Side of Humanity Scale: A Reconstruction of the Dark Tetrad Constructs.” Acta Psychologica 222 (February): 103461. https://doi.org/10.1016/j.actpsy.2021.103461.
Miller, George A. 2003. “The Cognitive Revolution: A Historical Perspective.” Trends in Cognitive Sciences 7 (3): 141–44. https://doi.org/10.1016/S1364-6613(03)00029-9.
Miller, George A. 1956. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information.” Psychological Review 63 (2): 81–97. https://doi.org/10.1037/h0043158.
Mozes, Maximilian, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg, and Lewis Griffin. 2021. “Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8258–70. Online; Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.651.
Mozes, Maximilian, Xuanli He, Bennett Kleinberg, and Lewis D. Griffin. 2023. “Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities.” arXiv. http://arxiv.org/abs/2308.12833.
Mozes, Maximilian, Bennett Kleinberg, and Lewis Griffin. 2022. “Identifying Human Strategies for Generating Word-Level Adversarial Examples.” Findings of EMNLP 2022.
Mozes, Maximilian, Pontus Stenetorp, Bennett Kleinberg, and Lewis Griffin. 2021. “Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples.” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 171–86. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.13.
Peereboom, Sanne, Inga Schwabe, and Bennett Kleinberg. n.d. “Cognitive Phantoms in LLMs Through the Lens of Latent Variables.”
Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W Crandall, et al. 2019. “Machine Behaviour.” Nature 568 (7753): 477–86.
Skinner, B. F. 1935. “Two Types of Conditioned Reflex and a Pseudo Type.” The Journal of General Psychology 12 (1): 66–77. https://doi.org/10.1080/00221309.1935.9920088.
Templeton, Adly, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, et al. 2024. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
Tolman, E. C., and C. H. Honzik. 1930. “Introduction and Removal of Reward, and Maze Performance in Rats.” University of California Publications in Psychology 4: 257–75.
Tolman, Edward C. 1948. “Cognitive Maps in Rats and Men.” Psychological Review 55 (4): 189–208. https://doi.org/10.1037/h0061626.
Turing, Alan M. 2009. “Computing Machinery and Intelligence.” In Parsing the Turing Test, edited by Robert Epstein, Gary Roberts, and Grace Beber, 23–65. Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-1-4020-6710-5_3.
Urbina, Fabio, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. 2022. “Dual Use of Artificial-Intelligence-Powered Drug Discovery.” Nature Machine Intelligence 4 (3): 189–91. https://doi.org/10.1038/s42256-022-00465-9.
Watson, John B. 1913. “Psychology as the Behaviorist Views It.” Psychological Review 20 (2): 158–77. https://doi.org/10.1037/h0074428.
Yerkes, Robert M., and Sergius Morgulis. 1909. “The Method of Pawlow in Animal Psychology.” Psychological Bulletin 6 (8): 257–73. https://doi.org/10.1037/h0070886.